Quality of wine depending on components analysis by Francisco Schiappacasse

Univariate Plots Section

We will explore each one of the variables of the wine data set. Our main focus is the quality variables, as we want to have an idea of which components of the wine affect its quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Univariate Analysis

What is the structure of your dataset?

The white wines data set has 4898 observations and 13 variables ### What is/are the main feature(s) of interest in your dataset? The variable of interest is quality ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? All other variables are characteristics of the wine that should influence the quality ### Did you create any new variables from existing variables in the dataset? I just added the type in case I need to study the red wines ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? Most of the variables follor a normal distribution. I didn’t do any operations

Bivariate Plots Section

In this part we will go further by studying the relationship between each one of the variables. We will start with a matrix of the variables to have a quick idea of how they are related to each other.

## ww$alcohol.range: <8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     3.5     4.0     4.0     4.5     5.0 
## ------------------------------------------------------------ 
## ww$alcohol.range: 8-9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.606   6.000   8.000 
## ------------------------------------------------------------ 
## ww$alcohol.range: 9-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.488   6.000   8.000 
## ------------------------------------------------------------ 
## ww$alcohol.range: 10-11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.864   6.000   9.000 
## ------------------------------------------------------------ 
## ww$alcohol.range: 11-12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   6.000   6.191   7.000   8.000 
## ------------------------------------------------------------ 
## ww$alcohol.range: 12-13
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   7.000   6.571   7.000   9.000 
## ------------------------------------------------------------ 
## ww$alcohol.range: 13-14
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    6.00    7.00    6.72    7.00    8.00 
## ------------------------------------------------------------ 
## ww$alcohol.range: >14
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       7       7       7       7       7       7

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## ------------------------------------------------------------ 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## ------------------------------------------------------------ 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## ------------------------------------------------------------ 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## ------------------------------------------------------------ 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## ------------------------------------------------------------ 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## ------------------------------------------------------------ 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## ------------------------------------------------------------ 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## ------------------------------------------------------------ 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## ------------------------------------------------------------ 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## ------------------------------------------------------------ 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## ------------------------------------------------------------ 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## ------------------------------------------------------------ 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is not a clear relationship between the quality variable and any of the variables. Possibly the alochol level can be useful to secure a minimum level of alcohol. Low levels of alcohol are associated with lower quality wine. There are some other interesting ranges, such as pH and chlorides. And other recommendations such as the volatile acides should be below 0.6 and the free sulphur dioxide below 50 ppm.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are some relationships between alcohol-density, residual sugar - density, and fixed acidity and pH.

What was the strongest relationship you found?

The strongest relationship is between quality and alcohol level.

Multivariate Plots Section

In this section we will set a combination of variables and how they are related to quality. We want to find a range for the components that affect the quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol is clearly the one that makes the most difference, but after that you can observe clear ranges in some variables that have a concentration of higher quality wines.

Were there any interesting or surprising interactions between features?

For instance, Alcohol and density not also have influence on the quality, but also on each other. A higher density means a lower alcohol level.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Just a linear regression between quality and alcohol level. It is the most influential variable, however you can use alcohol levels of 20, and the quality would be better than one with alcohol level of 14. This in reality is not true. Also the R^2 is very low.


Final Plots and Summary

Plot One

Description One

Alcohol level has the major influence on quality. We can see that lower quality wines are mainly below 12% alcohol.

Plot Two

Description Two

Density also affects the quality. Density and alcohol are negatively correlated. It could be seen that for density levels below 0.9925, the quality improves. There are some other variables that have smaller influence, such as pH, chlorides, free sulfur dioxide and total sulfur dioxide. The best ranges for each of the variables are marked on the graph.

Plot Three

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   6.000   7.000   6.652   7.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Description Three

I created a subset to select the best wines from the list according to the ranges found on the previous graphs. This selection includes 164 wines, the quality median is 7 (vs 6 of the whole data set), the first and third quantile also improve by one point.


Reflection

There is not a clear way to determine the quality of a wine with all the variables. You can find some ranges for some of the variables but that doesn’t assure the quality will be the best.

It could be more conclusive to determine for example ranges where you can find a majority of low quality wines. For instance, alcohol levels below 10%.

Additionally, quality can be a subjective variable depending on who is the one grading the wines. This makes it more difficult to reach a conclusion of which are the optimum parameters for the best wine.